Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra (Advisor: Dr. Ashraf Cohen)

Published

October 3, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Introduction

A major public health concernDiabetes mellitus (DM) is associated with obesity, age, race, and gender and identifying associated risk factor is crucial for targeted intervention.Logistic Regression estimates the association between risk factors and binary outcomes (presence or absence of diabetes). Standard analytical approaches are insufficient in analyzing the complexity of healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. (Zeger et al., 2020). However, classical maximum likelihood estimation (MLE) yields unstable results in small samples with missing data, or quasi- and complete separation. The Bayesian hierarchical model with Markov Chain Monte Carlo (MCMC) implemented on multivariate longitudinal healthcare data by integrating prior knowledge predict health status (Zeger et al., 2020). Model with two levels of data structure: (1) repeated measures over time within individuals and (2) individuals nested within a population, with added exogenous covariates (e.g., age, clinical history), and endogenous covariates (e.g., current treatment), yield posterior distributions and marginal distributions from MCMC estimation of parameters provide risk prediction (pneumonia, prostate cancer, and mental disorders). The model’s limitation is its parametric nature.

Application of Bayesian Inference Chatzimichail and Hatjimihail (2023), comparing parametric (with a fixed set of parameters) and non-parametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations, and provides posterior probability classifying diseases. Clinical criteria and fixed numerical thresholds in conventional and the dichotomous method (overlap of probability distributions between the diseased and nondiseased groups) fails to capture the intricate relationship between diagnostic tests and the prevalence of the diseases, the complexity and heterogeneity across diverse populations and its applicability in skewed, bimodality, or multimodality data is critiqued. Bayesian nonparametric (vs parametric) is a flexible, adaptable, versatile, and robust approach, capturing complex data patterns, producing multimodal probability patterns vs the bimodal, double-sigmoidal curves in parametric models. Integrating priors, combined with multiple diagnostic tests, improves diagnostic accuracy and precision. Model applicability is limited by access to scholarly publications and over-dependence on priors. Combining with other statistical and computational techniques enhances diagnostic capabilities Chatzimichail and Hatjimihail (2023) to overcome systemic bias, unrepresentative, incomplete, and non-normal datasets.

Bayesian methodology by Schoot et al. (2021) emphasizes the importance of priors, data modeling, inferences, model checking, sampling from a posterior distribution, variational inferences, and variable selection for applicability across social sciences, ecology, genetics, and medicine. The variable selection is crucial as multicollinearity, insufficient sampling, and overfitting result in poor predictive performance and difficult interpretation. Informative, weakly informative, and diffuse prior incorporation depending on (un)certainty in (hyperparameters), where a larger variance representing greater uncertainty. Prior elicitation (experts, generic experts, data-based, and sample data using maximum likelihood or sample statistics), and prior sensitivity analysis of the likelihood assesses how the priors and the likelihood align. Prior provides data-informed shrinkage, regularization, or influence algorithms, providing a high-density region, improving estimation. Specifying prior information in small and less informative samples, strengthens the observed data with unknown parameters having varied values, observed data having fixed values, and the likelihood function generate a range of possible values and integrating the MCMC algorithm for sampled values from a given distribution through computer simulations provide empirical estimates of the posterior distribution (BRMS and Blavaan in R). The frequentist method does not consider the probability of the unknown parameters and considers them as fixed, while likelihood is based on the conditional probability distribution. Spatial and temporal Bayesian models has applicability in large-scale cancer genomic data, identifying novel molecular-level changes, interactions between mutated genes, capturing mutational signatures, allowing genomic-based patient stratification in clinical trials, and targeted treatments and in understanding cancer evolution. The Bayesian model is reproducible, but is limited by autocorrelation in the temporal model and by subjectivity in prior elicitation.

Prior elicitation, analytical posteriors, robustness checks in Bayesian Normal linear regression, and parametric (conjugate) model incorporating Normal–Inverse-Gamma prior have been demonstrated in metrology Klauenberg et al. (2015) to calibrate instruments. In Gaussian, errors are independent and identically distributed, the variance is unknown, the relationship between X and Y is statistical, with noise and model uncertainty, and the regression can not be treated as a measurement function. Likelihood, Bayesian, bootstrap, etc., account for uncertainty, prior information, and observables (data) and unobservables (parameters and auxiliary variables) are unknown and random, and the assigned probability distributions update prior knowledge about the unobservables, enhance interpretation and robustify analyses. The Normal-Inverse Gamma (NIG) distribution from the same family as the conjugate prior with unknown mean and variance can specify vague or non-informative priors. Hierarchical prior add an additional layer of distributions, accounting for uncertainty to be more flexibly modeled.

Bayesian Hierarchical / meta-analytic linear regression model (DeLeeuw2012?) augments data by incorporating both exchangeable and unexchangeable information on parameters addressing issues associated with multiple testing with low statistical power, and the issues of conducting separate significance tests across studies with different predictors, and the need for larger samples. Linear regression produce smaller, unreliable estimates vulnerable to sample variations. Priors from meta-analysis in Bayesian regression addresses the challenge of small sample size and unavailability of previous articles, resolving the limitations of univariate analyses, and the relationship issues among multiple regression parameters within a study. Priors based on previous data and current data are categorized (1) Exchangeable when the current data and previous studies share the same set of predictors, and (2) Unexchangeable when the predictors differ. The probability density function for the data (using the Gibbs sampler), and the likelihood function reflect prior assumptions about the model. The hierarchical unexchangeable model provide applicability is in studying differences in studies, enabling explicit testing of the exchangeability assumption. Application is limited due to the correlation between identical set of predictors. (DeLeeuw, 2012).

Bayesian logistic regression (Bayesian GLM)**- A sequential clinical reasoning model. Liu (2013) demonstrated its applicability in screening adults (20–79 years, Taiwan) addressing the limited availability of molecular information and as an alternative method leveraging routinely collected biological markers classifying diseases. Sequential adding of predictors in three models: (1) demographic features (basic model), (2) six metabolic components (metabolic score model), and (3) conventional risk factors (enhanced model), incorporating priors, and emulating a clinician’s evaluation process, the model assumes normally distributed regression coefficients, accounts for uncertainty in clinical weights, and averages credible intervals for predicted risk estimates. The posterior distributions produced in Enhanced model showed that patient background significantly contributed to baseline risk estimation by integrating individual characteristics capturing ecological heterogeneity. The model applicability is limited by potential interactions between predictors and external cross-validation.

Bayesian multiple imputation with logistic regression, (Austin?) 2021, addresses missing data in clinical research. Analyzing causes of missing values (i) patients refusing to answer specific questions, (ii) loss to follow-up, (iii) investigator or mechanical errors, or (iv) physicians choosing not to order certain investigations and understanding missingness: missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) is crucial. Multiple imputation (MI) using R, SAS, or Stata provide plausible values creating multiple completed datasets while simultaneously conducting identical statistical analyses across them, robustify estimates through pooled results in classifying patients health status and mortality rates.

Aims

The present study focuses on the application of Bayesian logistic regression to predict diabetes status based on body mass index (BMI), age, gender, and race as predictors using a retrospective dataset (2013–2014 NHANES survey). The dataset reveals challenges such as quasi-separation, missing values, and a relatively small effective sample size, and the traditional logistic regression has limitations in dealing with these anomalies. Initial data exploration yielded 9,813 observations across five selected variables. The results from complete case analysis, listwise deletion, substantially reduced the sample size to only 14 complete cases and presented quasi-separation with implausibly large coefficients and unstable estimates. The analytic limitations of traditional logistic regression motivate us to perform Multiple Imputation by Chained Equations (MICE) in conjunction with Bayesian logistic regression. The approach could provide a flexible framework for modeling uncertainty, incorporating prior knowledge, and addressing issues related to quasi-separation and limited sample size.

Method and Data Preparation

Statistical Tool R, R packages, and libraries are used to import, manage and analyze the data. Data is collected from NHANES 2-year cross-sectional data (2013-2014 year) using 3 datasets (demographics, exam, questionnaire) Center for Health Statistics (1999). Haven package coverted .XPT files in R to dataframe (df).

Code
# loading packages 
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("nhanesA")
library ("nhanesA")    

library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(Hmisc)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)

library(classpackage)
library(janitor)
install.packages("gt")   
library(gt)
library(survey)
library(DataExplorer)
library(logistf)

Data Management

Subsets created from the original weighted 3 datasets (demographics, exam, questionnaire) were merged using ID into a single dataframe. The merged dataframe included selected variables of interest.

  1. Response Variable (Binary, Diabetes) was defined as - “Is there one Dr you see for diabetes”
  2. Predictor Variables (Body Mass Index, factor, 4 levels) The original data has BMDBMIC (measured BMI) as categorical and had no missing values. It (BMI) has the following 4 levels:
    o Underweight (<5th percentile)
    o Normal (5th–<85th)
    o Overweight (85th–<95th) o Obese (≥95th percentile)
    o Missing We kept it as it is as categorization provides clinically interpretable groups
  3. Covariates:
  • Gender (factor, 2 levels): Male: Female
  • Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including Multi-Racial
  • Age (num, continuous)
Code
library(gt)
# formation of table with variable details

variables <- c("ID",       "BMI" ,     "Age",      "Gender" ,  "Race",     "PSU",      "Strata",   "Weight" ,  "Diabetes")
df <- data.frame(Variable = variables, Description = descriptions <- c("Respondent sequence number", 
        "BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.", 
                  "Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.", 
                  "Gender", 
                  "Race/ethnicity Recode of reported race and Hispanic origin information", 
                  "Sample PSU", 
                  "Sample strata", 
                  "MEC exam weight", 
                  "Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors."))
df %>%
  gt %>%
  tab_header(
    title = "Table Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
  )
Table Variable Description
Variable Description
ID Respondent sequence number
BMI BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.
Age Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.
Gender Gender
Race Race/ethnicity Recode of reported race and Hispanic origin information
PSU Sample PSU
Strata Sample strata
Weight MEC exam weight
Diabetes Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors.
Each variable in the dataset, accompanied by a qualitative description from the study team.

The dataframe structure, missing values and a plot of the data and the breakdown of the missingness reveal only 14 complete cases with no NAs.

Code
# weighted means of each variable                       
str(merged_data)
'data.frame':   9813 obs. of  9 variables:
 $ ID      : num  73557 73558 73559 73560 73561 ...
 $ BMI     : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
 $ Age     : num  69 54 72 9 73 56 0 61 56 65 ...
 $ Gender  : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
 $ Race    : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
 $ PSU     : num  1 1 1 2 2 1 1 1 1 2 ...
 $ Strata  : num  112 108 109 109 116 111 105 114 112 112 ...
 $ Weight  : num  13481 24472 57193 55767 65542 ...
 $ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
Code
plot_str(merged_data)
introduce(merged_data)
  rows columns discrete_columns continuous_columns all_missing_columns
1 9813       9                4                  5                   0
  total_missing_values complete_rows total_observations memory_usage
1                15381            14              88317       554872
Code
plot_intro(merged_data, title="Figure 1 (Merged dataset). Structure of variables and missing observations.")

Code
plot_missing(merged_data, title="Figure 2(Merged dataset). Breakdown of missing observations.")

Code
# Correct survey design
nhanes_design <- svydesign(
  id = ~PSU,        # primary sampling unit
  strata = ~Strata, # stratification variable
  weights = ~Weight,# survey weights
  data = merged_data,
  nest = TRUE
)

# Weighted proportion of Diabetes
svymean(~Diabetes, design = nhanes_design, na.rm = TRUE)
               mean    SE
DiabetesYes 0.78759 0.021
DiabetesNo  0.21241 0.021
Code
svymean(~Age , design = nhanes_design, na.rm = TRUE)
      mean     SE
Age 37.504 0.4412
Code
svymean(~BMI, design = nhanes_design, na.rm = TRUE)
                     mean     SE
BMIUnderweight   0.037703 0.0037
BMINormal weight 0.628530 0.0120
BMIOverweight    0.162217 0.0056
BMIObese         0.171551 0.0109
Code
svymean(~Gender, design = nhanes_design, na.rm = TRUE)
                mean     SE
GenderMale   0.48861 0.0065
GenderFemale 0.51139 0.0065
Code
svymean(~Race, design = nhanes_design, na.rm = TRUE)
                                            mean     SE
RaceMexican American                    0.110450 0.0199
RaceOther Hispanic                      0.060637 0.0106
RaceNon-Hispanic White                  0.622315 0.0365
RaceNon-Hispanic Black                  0.120895 0.0173
RaceOther Race - Including Multi-Racial 0.085704 0.0076

Explain your data preprocessing and cleaning steps.

  • Using library(survey), the weighted means and sd of the variables extracted
  • Data varaibles categorized and recoded.
  • Special codes are not random and cannot be dropped, the informative missingness if ignored (MAR or MNAR) could introduce bias, we transformed special codes into NAs (based on the variable codebook).
  • All NAs were included in the analysis. NAs are automatically excluded (row-wise deletion) in linear regression lm ()
Code
library(dplyr)
library(knitr)

# 1. continuous variable summary
cont_summary <- merged_data %>%
  summarise(
    Mean = round(mean(Age, na.rm = TRUE), 2),
    SD   = round(sd(Age, na.rm = TRUE), 2),
    Min  = min(Age, na.rm = TRUE),
    Max  = max(Age, na.rm = TRUE)
  ) %>%
  mutate(
    Variable = "Age",
    Category = "Continuous",
    Count = nrow(merged_data) - sum(is.na(merged_data$Age)),
    Proportion = NA
  ) %>%
  select(Variable, Category, Count, Proportion, Mean, SD, Min, Max)

# 2. categorical variables
cat_summary <- function(df, var, name) {
  df %>%
    count({{var}}) %>%
    mutate(
      Proportion = round(n / sum(n), 3),
      Variable = name,
      Mean = NA, SD = NA, Min = NA, Max = NA
    ) %>%
    rename(Category = {{var}}, Count = n) %>%
    select(Variable, Category, Count, Proportion, Mean, SD, Min, Max)
}

# 3. summaries for categorical variables
gender_summary <- cat_summary(merged_data, Gender, "Gender")
BMI_summary    <- cat_summary(merged_data, BMI, "BMI Category")
race_summary   <- cat_summary(merged_data, Race, "Race")
diab_summary   <- cat_summary(merged_data, Diabetes, "Diabetes")

# 4. combine all into one table
final_table <- bind_rows(
  cont_summary,
  gender_summary,
  BMI_summary,
  race_summary,
  diab_summary
)

# 5. display as a table
kable(final_table, caption = "Table 1. Descriptive Statistics of Study Variables")
Table 1. Descriptive Statistics of Study Variables
Variable Category Count Proportion Mean SD Min Max
Age Continuous 9813 NA 31.63 24.4 0 80
Gender Male 4831 0.492 NA NA NA NA
Gender Female 4982 0.508 NA NA NA NA
BMI Category Underweight 132 0.013 NA NA NA NA
BMI Category Normal weight 2167 0.221 NA NA NA NA
BMI Category Overweight 595 0.061 NA NA NA NA
BMI Category Obese 629 0.064 NA NA NA NA
BMI Category NA 6290 0.641 NA NA NA NA
Race Mexican American 1685 0.172 NA NA NA NA
Race Other Hispanic 930 0.095 NA NA NA NA
Race Non-Hispanic White 3538 0.361 NA NA NA NA
Race Non-Hispanic Black 2198 0.224 NA NA NA NA
Race Other Race - Including Multi-Racial 1462 0.149 NA NA NA NA
Diabetes Yes 553 0.056 NA NA NA NA
Diabetes No 169 0.017 NA NA NA NA
Diabetes NA 9091 0.926 NA NA NA NA
Code
library(gt)

# Cross-tabulation: Diabetes vs BMI
tab1 <- table(merged_data$Diabetes, merged_data$BMI, useNA = "ifany")
prop.table(tab1) * 100  # overall percentages
      
       Underweight Normal weight  Overweight       Obese        <NA>
  Yes   0.00000000    0.05095282  0.03057169  0.02038113  5.53347600
  No    0.00000000    0.02038113  0.02038113  0.00000000  1.68144298
  <NA>  1.34515439   22.01161724  6.01243249  6.38948334 56.88372567
Code
# Cross-tabulation: Race vs Diabetes
tab2 <- table(merged_data$Race, merged_data$Diabetes, useNA = "ifany")
prop.table(tab2) * 100  # overall percentages
                                     
                                             Yes         No       <NA>
  Mexican American                     0.8763885  0.4178131 15.8768980
  Other Hispanic                       0.4891470  0.1528585  8.8352186
  Non-Hispanic White                   2.0992561  0.5706716 33.3842862
  Non-Hispanic Black                   1.4878223  0.3668603 20.5441761
  Other Race - Including Multi-Racial  0.6827678  0.2140018 14.0018343
Code
# Cross-tabulation: Gender vs Diabetes
tab3 <- table(merged_data$Gender, merged_data$Diabetes, useNA = "ifany")
prop.table(tab3) * 100  # overall percentages
        
                Yes         No       <NA>
  Male    2.6903088  0.8865790 45.6537247
  Female  2.9450729  0.8356262 46.9886885
Code
## Bar plot of Age, gender, race, diabetes status, BMI ## 

plot_bar(merged_data, title = "Figure 3(Merged dataset). Frequency plots of categorical variables.")

Method

The study conducted frequentist methods: Multiple Logistic regression model, Baseline Regression model and Firth penalized regression on NHANES dataset to predict Diabetes as a function of BMI, age, race, and gender and then compared results from MLR and Bayesian model.

Bayesian Logistic Regression

  • Bayesian Logistic Regression statistical analysis is selected for our data as the study response variable is a binary outcome (Diabetes:yes/no)
  • Bayesian Logistic Regression is based on binomial probability Bayes rules
  • Bayes rule analyzes linear relation between predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes) where the predictors and response variables are independent.
  • Regression of discrete variable that can have two values, 0 or 1 is Bernoulli probability model to classify categorical response variables - predicting Diabetes.
  • Logit link provides probabilities for the response variable.

Data exploration of the raw dataset

Before running Bayesian regression, basic statistics, summary, anamolies and patterns reported Descriptive statistics of variable (counts, frequencies, proportions, mean and sd), and the proportions of variable/s calculated. Visualization using frequency plots for continuous and categorical variables

  1. Multiple logistic regression on raw dataset
  • resulted in small sample size (n=14) - listwise deletion of NAs
  • quasi-separation Van Buuren and Van Buuren (2012).
  • Data revealed breakdown of missingness missing at random and missing not at random (MAR and MNAR) - A common reported issue of the healthcare and public health datasets. Presented here is the plot depicting breakdown of missingness
  1. Baseline regression model (BMI only predictor)
  • Baseline model conducted to compare whether predictors significantly improve predictive power.
  • The regression resulted in small drop and that BMI category adds very little predictive value over just assuming the overall diabetes prevalence.
  • Null deviance = 16.75 (baseline fit).
  • Residual deviance = 15.11 (with BMI).
  1. Firth (penalized) regression
  • Firth (penalized) regression (frquentist approach) was considered to handle quasi-separation, D’Angelo and Ran (2025) that use Jeffreys prior for bias correction. It does not provide posterior and no sampling using MCMC (vs) bayesian logisitic regression.

Unexpected reports, patterns or anomalies in the raw data

  • Issue of quasi-complete separation (9799 observations dropped)
  • Reduced sample size with reduced number of complete cases (n=14).
  • The model is overfitted to this subset and cannot be generalized.
  • Huge coefficients (e.g., 94, –50, 73) and the tiny residual deviance suggest perfect separation and sparse data in some categories with very few observations, resulted in imbalance in the outcome (very few cases of 0 or 1).
  • Logistic regression cannot estimate stable coefficients when predictors perfectly classify the outcome.
  • Firth regression dealt with quasi-separation with coefficients as finite, but the reduced sample size (n= 14) where estimates are highly uncertain, wide confidence intervals → cannot make strong claims about predictor effects.
  • multivariate missingness, non-monotone (arbitrary) missingness, a connected pattern was observed in some cases in all variables. The issue is crucial because in order to be able to estimate a correlation coefficient between two variables, they need to be connected, either directly by a set of cases that have scores on both, or indirectly through their relation with a third set of connected data.

Model Interpretations: - Only 14 non-missing cases could not be trusted (small sample) and quasi-separation problem - Models with all predictors together and the baseline model resulted in unstable and extreme estimates with standard errors not meaningful. - Adding more predictors makes the deviance drop but indicated overfitting/separation, and no true explanatory power. - BMI anly model contributes very little to the response varaible. Race and gender make models appear stronger, but the small sample (n=14) with complete separation, could not be generalized.

Considering the data anamolies, we decided to retain full N = ~9813, deal with small sample size and quasi-separation by conducting Multivariate Imputation by Chained Equations (MICE)

  1. Multivariate Imputation by Chained Equations (MICE)
  • Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
  • Multiple imputation (MI) is an alternative Bayesian approach for small dataset.
  • Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
  • Multiple Imputation (MI) use popular mice package in R) and adds sampling variability to the imputations.
  • Iterative Imputation (MICE) imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • This is a chain process, with each imputed variable becoming a predictor for the subsequent imputation and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
  • Each variable is imputed using its own appropriate univariate regression model.
Code
## Multiple Imputation performed 
# Subset variables for imputation in analytic_data df
library(dplyr)
library(ggplot2)
library(mice)
library(VIM)
library(janitor)

# 1. Select variables for imputation
vars <- c("ID", "BMI", "Age", "Gender", "Race", "PSU", "Strata", "Weight", "Diabetes")
analytic_data <- merged_data[, vars]

glimpse(merged_data)
Rows: 9,813
Columns: 9
$ ID       <dbl> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ BMI      <fct> NA, NA, NA, Normal weight, NA, NA, NA, NA, NA, NA, NA, Normal…
$ Age      <dbl> 69, 54, 72, 9, 73, 56, 0, 61, 56, 65, 26, 9, 76, 10, 10, 33, …
$ Gender   <fct> Male, Male, Male, Male, Female, Male, Male, Female, Female, M…
$ Race     <fct> Non-Hispanic Black, Non-Hispanic White, Non-Hispanic White, N…
$ PSU      <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1…
$ Strata   <dbl> 112, 108, 109, 109, 116, 111, 105, 114, 112, 112, 113, 115, 1…
$ Weight   <dbl> 13481.042, 24471.770, 57193.285, 55766.512, 65541.871, 25344.…
$ Diabetes <fct> Yes, Yes, No, NA, NA, NA, NA, NA, NA, NA, NA, NA, No, NA, NA,…
Code
# 2. Run mice to create 5 imputed datasets
imputed_data <- mice(
  analytic_data,
  m = 5,              # number of imputed datasets
  method = 'pmm',     # predictive mean matching
  seed = 123
)

 iter imp variable
  1   1  BMI  Diabetes
  1   2  BMI  Diabetes
  1   3  BMI  Diabetes
  1   4  BMI  Diabetes
  1   5  BMI  Diabetes
  2   1  BMI  Diabetes
  2   2  BMI  Diabetes
  2   3  BMI  Diabetes
  2   4  BMI  Diabetes
  2   5  BMI  Diabetes
  3   1  BMI  Diabetes
  3   2  BMI  Diabetes
  3   3  BMI  Diabetes
  3   4  BMI  Diabetes
  3   5  BMI  Diabetes
  4   1  BMI  Diabetes
  4   2  BMI  Diabetes
  4   3  BMI  Diabetes
  4   4  BMI  Diabetes
  4   5  BMI  Diabetes
  5   1  BMI  Diabetes
  5   2  BMI  Diabetes
  5   3  BMI  Diabetes
  5   4  BMI  Diabetes
  5   5  BMI  Diabetes
Code
# 3. First imputed dataset
Imputed_data1 <- complete(imputed_data, 1)

# 4. Check missingness
str(Imputed_data1)
'data.frame':   9813 obs. of  9 variables:
 $ ID      : num  73557 73558 73559 73560 73561 ...
 $ BMI     : Factor w/ 4 levels "Underweight",..: 4 2 3 2 3 2 2 3 4 2 ...
 $ Age     : num  69 54 72 9 73 56 0 61 56 65 ...
 $ Gender  : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
 $ Race    : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
 $ PSU     : num  1 1 1 2 2 1 1 1 1 2 ...
 $ Strata  : num  112 108 109 109 116 111 105 114 112 112 ...
 $ Weight  : num  13481 24472 57193 55767 65542 ...
 $ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 1 1 2 1 1 1 2 ...
Code
summary(Imputed_data1)
       ID                   BMI            Age           Gender    
 Min.   :73557   Underweight  : 218   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:5107   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   :1831   Median :27.00                
 Mean   :78645   Obese        :2657   Mean   :31.63                
 3rd Qu.:81191                        3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  Race           PSU            Strata     
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
     Weight       Diabetes  
 Min.   :  3748   Yes:7252  
 1st Qu.: 13187   No :2561  
 Median : 20965             
 Mean   : 31713             
 3rd Qu.: 37922             
 Max.   :171395             
Code
colSums(is.na(Imputed_data1))
      ID      BMI      Age   Gender     Race      PSU   Strata   Weight 
       0        0        0        0        0        0        0        0 
Diabetes 
       0 
Code
plot_intro(Imputed_data1, title="Figure 8. Structure of variables and missing observations.")

Code
plot_bar(Imputed_data1, title = "Figure 9. Frequency plots of categorical variables.")

Code
plot_correlation(na.omit(Imputed_data1[, c("BMI", "Diabetes")]), maxcat=5L, title = "Figure")

Code
# 5. Cross-tabulation
tab <- table(Imputed_data1$BMI, Imputed_data1$Diabetes)
print(tab)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
# Chi-square test
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 2.1289, df = 3, p-value = 0.5461
Code
# Cross-tabulation

# BMI vs Diabetes
tab_BMI <- table(Imputed_data1$BMI, Imputed_data1$Diabetes)
print(tab_BMI)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
prop.table(tab_BMI, 1) * 100  # row percentages
               
                     Yes       No
  Underweight   72.93578 27.06422
  Normal weight 73.62444 26.37556
  Overweight    73.29328 26.70672
  Obese         74.93414 25.06586
Code
# Gender vs Diabetes
tab_gender <- table(Imputed_data1$Gender, Imputed_data1$Diabetes)
prop.table(tab_gender, 1) * 100
        
              Yes       No
  Male   72.92486 27.07514
  Female 74.84946 25.15054
Code
# Race vs Diabetes
tab_race <- table(Imputed_data1$Race, Imputed_data1$Diabetes)
prop.table(tab_race, 1) * 100
                                     
                                           Yes       No
  Mexican American                    71.03858 28.96142
  Other Hispanic                      74.40860 25.59140
  Non-Hispanic White                  76.17298 23.82702
  Non-Hispanic Black                  72.47498 27.52502
  Other Race - Including Multi-Racial 73.52941 26.47059
Code
# Age vs Diabetes
tab_age <- table(Imputed_data1$Age, Imputed_data1$Diabetes)
head (prop.table(tab_age, 1) * 100)
   
         Yes       No
  0 74.93606 25.06394
  1 74.89879 25.10121
  2 72.51908 27.48092
  3 74.64115 25.35885
  4 73.48837 26.51163
  5 67.17172 32.82828
Code
# Breakdown of Diabetes within BMI
breakdown_BMI <- Imputed_data1 %>%
  group_by(BMI, Diabetes) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(BMI) %>%
  mutate(
    Percent = round(100 * Count / sum(Count), 1)
  )
breakdown_BMI
# A tibble: 8 × 4
# Groups:   BMI [4]
  BMI           Diabetes Count Percent
  <fct>         <fct>    <int>   <dbl>
1 Underweight   Yes        159    72.9
2 Underweight   No          59    27.1
3 Normal weight Yes       3760    73.6
4 Normal weight No        1347    26.4
5 Overweight    Yes       1342    73.3
6 Overweight    No         489    26.7
7 Obese         Yes       1991    74.9
8 Obese         No         666    25.1
Code
# 6. Frequency tables for categorical variables
categorical_vars <- c("BMI", "Gender", "Race", "Diabetes")

for (var in categorical_vars) {
  cat("\nFrequency table for", var, ":\n")
  print(table(Imputed_data1[[var]]))
  print(round(prop.table(table(Imputed_data1[[var]])), 3))
}

Frequency table for BMI :

  Underweight Normal weight    Overweight         Obese 
          218          5107          1831          2657 

  Underweight Normal weight    Overweight         Obese 
        0.022         0.520         0.187         0.271 

Frequency table for Gender :

  Male Female 
  4831   4982 

  Male Female 
 0.492  0.508 

Frequency table for Race :

                   Mexican American                      Other Hispanic 
                               1685                                 930 
                 Non-Hispanic White                  Non-Hispanic Black 
                               3538                                2198 
Other Race - Including Multi-Racial 
                               1462 

                   Mexican American                      Other Hispanic 
                              0.172                               0.095 
                 Non-Hispanic White                  Non-Hispanic Black 
                              0.361                               0.224 
Other Race - Including Multi-Racial 
                              0.149 

Frequency table for Diabetes :

 Yes   No 
7252 2561 

  Yes    No 
0.739 0.261 
Code
# 7. Summary statistics for continuous variables
continuous_vars <- c("Age")

for (var in continuous_vars) {
  cat("\nSummary statistics for", var, ":\n")
  print(summary(Imputed_data1[[var]]))
  print(paste("SD:", round(sd(Imputed_data1[[var]]), 2)))
}

Summary statistics for Age :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   10.00   27.00   31.63   52.00   80.00 
[1] "SD: 24.4"
Code
# 8. Bar plots for categorical variables
for (var in categorical_vars) {
  ggplot(Imputed_data1, aes_string(x = var)) +
    geom_bar(fill = "steelblue") +
    labs(title = paste("Bar plot of", var), y = "Count") +
    theme_minimal() -> p
  print(p)
}

Code
# 9. Histograms for continuous variables
for (var in continuous_vars) {
  ggplot(Imputed_data1, aes_string(x = var)) +
    geom_histogram(fill = "skyblue", color = "black", bins = 30) +
    labs(title = paste("Histogram of", var), y = "Frequency") +
    theme_minimal() -> p
  print(p)
}

Code
# 10. Scatter plot example (BMI vs Age)
ggplot(Imputed_data1, aes(x = Age, y = BMI, color = BMI)) +
  geom_point(alpha = 0.6) +
  labs(title = "Scatter plot of BMI vs Age", y = "BMI", x = "Age") +
  theme_minimal()

Code
# 11. Relative breakdown of Diabetes by BMI
ggplot(breakdown_BMI, aes(x = BMI, y = Percent, fill = Diabetes)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Relative Breakdown of Diabetes by BMI Category",
    x = "BMI Category", y = "Proportion"
  ) +
  theme_minimal()

Code
# 12. Crosstab with percentages
Imputed_data1 %>% 
  tabyl(Diabetes, BMI) %>% 
  adorn_percentages("col")
 Diabetes Underweight Normal weight Overweight     Obese
      Yes   0.7293578     0.7362444  0.7329328 0.7493414
       No   0.2706422     0.2637556  0.2670672 0.2506586
Code
# 13. Margin plot for BMI vs Diabetes
marginplot(
  Imputed_data1[, c("BMI", "Diabetes")],
  col = mdc(1:2, trans = FALSE),
  cex = 1.2,
  cex.lab = 1.2,
  cex.numbers = 1.3,
  pch = 19
)

Results from MICE:

  • MI resulted in 9813 observations with no NAs.
  • A bar plot on missingness is presented below.
  • A heatmap of correlation between BMI and Diabetes status reported no strong linear association between variables.
  • Chi-square p-value = 0.5461, which is > 0.05 revealing no evidence of association.
  • Imputed data check in marginal plot - shows that the distribution of imputed points is consistent with observed data
  • no strange outliers

Modeling

Bayesian Logistic Regression model

\[ \text{logit}(P(Y_i=1)) = \beta_0 + \beta_1 \cdot Age_i + \beta_2 \cdot BMI_i + \beta_3 \cdot Race_i + \beta_4 \cdot Gender_i \]

Linear Regression equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Comparison of multiple imputation and Bayesian data augmentation

Multiple imputation Bayesian data augmentation
  • frequentist approach and requires no priors, and has moderate flexibility
  • performs missing data imputation and regression model fitting simultaneously

  • Markov Chain Monte Carlo (MCMC) draws samples from the joint posterior of regression parameters, missing values and provide complete datasets by extracting posterior means, credible intervals, and probabilities

  • handles missing values first by imputation, performs regression analysis, pools results
  • performed on the data with missingness

  • shrink extreme estimates back toward plausible values

  • propagate uncertainty added after analysis (pooling).
  • handles uncertainty in missing values fully propagated through the model, naturally handles small or sparse datasets and separation problems.

Diagnostics performed before regression analysis

  1. Modeling assumption check
  2. Correlation matrix, Cook’s distance, influential points
  3. Model plot
Code
# MLR on imputed data
# assumption check (# for correlation # )
# correlation matrix
# Frequentist logistic regression on imputed data

m_imp <- glm(Diabetes ~ Age + Gender + Race + BMI,
             data = Imputed_data1,
             family = binomial)
summary(m_imp)

Call:
glm(formula = Diabetes ~ Age + Gender + Race + BMI, family = binomial, 
    data = Imputed_data1)

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                             -0.761926   0.162425  -4.691 2.72e-06
Age                                     -0.004728   0.001011  -4.678 2.89e-06
GenderFemale                            -0.092730   0.046133  -2.010  0.04442
RaceOther Hispanic                      -0.154250   0.092549  -1.667  0.09558
RaceNon-Hispanic White                  -0.212663   0.067746  -3.139  0.00169
RaceNon-Hispanic Black                  -0.055537   0.072128  -0.770  0.44131
RaceOther Race - Including Multi-Racial -0.109388   0.080345  -1.361  0.17336
BMINormal weight                         0.024339   0.156361   0.156  0.87630
BMIOverweight                            0.071544   0.162658   0.440  0.66005
BMIObese                                 0.020287   0.161137   0.126  0.89981
                                           
(Intercept)                             ***
Age                                     ***
GenderFemale                            *  
RaceOther Hispanic                      .  
RaceNon-Hispanic White                  ** 
RaceNon-Hispanic Black                     
RaceOther Race - Including Multi-Racial    
BMINormal weight                           
BMIOverweight                              
BMIObese                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11267  on 9812  degrees of freedom
Residual deviance: 11219  on 9803  degrees of freedom
AIC: 11239

Number of Fisher Scoring iterations: 4
Code
coef(m_imp)
                            (Intercept)                                     Age 
                           -0.761925862                            -0.004727998 
                           GenderFemale                      RaceOther Hispanic 
                           -0.092730385                            -0.154250348 
                 RaceNon-Hispanic White                  RaceNon-Hispanic Black 
                           -0.212662854                            -0.055536906 
RaceOther Race - Including Multi-Racial                        BMINormal weight 
                           -0.109388375                             0.024339471 
                          BMIOverweight                                BMIObese 
                            0.071543638                             0.020286793 
Code
confint(m_imp)
                                               2.5 %       97.5 %
(Intercept)                             -1.086867141 -0.449109500
Age                                     -0.006713516 -0.002751496
GenderFemale                            -0.183172208 -0.002320639
RaceOther Hispanic                      -0.336602276  0.026303770
RaceNon-Hispanic White                  -0.345137536 -0.079535197
RaceNon-Hispanic Black                  -0.196779898  0.086003952
RaceOther Race - Including Multi-Racial -0.267119094  0.047893182
BMINormal weight                        -0.276189156  0.337871931
BMIOverweight                           -0.241855094  0.396816724
BMIObese                                -0.289985572  0.342735874
Code
# Log-odds (link)
Imputed_data1$logit <- predict(m_imp, type = "link") ## log (Odds) 

# Probability
Imputed_data1$prob <- exp(Imputed_data1$logit) / (1 + exp(Imputed_data1$logit)) # prob 

# Plot predicted probability vs Age
ggplot(Imputed_data1, aes(x = Age, y = prob)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess") +
  labs(x = "Age", y = "Predicted Probability of Diabetes")

Findings from regression of MI data set

  1. MLR on imputed data (Frequentist approach)
  • Relationship between Age and log-odds of diabetes are roughly linear but not perfectly, but are acceptable for logistic regression assumptions.

  • Generalized Variance Inflation Factor (vif- adjusted report there is no collinearity between predictors (GVIF between ~1.0–1.04) and we can run model without removing or dropping or combining variables.

  • Cooks distance and influential points, we found - Most data points are safe, not influencing the model In the data with (~9813 cases), cutoff ≈ 0.0004. A cluster at high leverage shows unusual predictor values, but not high influence. A few above Cook’s Distance cutoff: worth checking individually, but no major threat to model stability. no outliers detected (not suspected = 9813)

  • Results from Hosmer–Lemeshow (H–L) at alpha =0.05, with p < 0.001, we find our logistic regression model does not fit the data well.

  • Graph shows Residual vs fitted (imputed data model)

  1. Results from Bayesian Data Augmentation and logistic regression
  • We incorporate prior knowledge that BMI increases diabetes odds by .,
  • We use priors for Bayesian logistic regression and compare the models with different priors in the model
    • Prior (intercept) - We use intercept prior from this study data
    • Prior (coefficients) - BMI, Age, gender
      • Weak prior N (0,2.5) -βBMI∼N(μ,σ2)
      • A common approach is to use a normal distribution, βBMI∼N(μ,σ2), for the regression coefficient. 
      • informative prior from previous studies βBMI∼N(μ,σ2) , βAge∼N(μ,σ2), βgender∼N(μ,σ2), βrace ∼N(μ,σ2)

For males, the informative prior Ali et al. (2024), we use is

Normal(μ = 1.705, σ² = 0.448²).

Code
# Fitted values and residuals
fitted_imputed1 <- fitted(m_imp)
residual_imputed1 <- residuals(m_imp)

# Residuals vs Fitted plot
plot(fitted_imputed1, residual_imputed1,
     xlab = "Fitted probabilities",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

Code
# Collinearity check
library(car)
vif(m_imp)  # VIF > 5 indicates multicollinearity
           GVIF Df GVIF^(1/(2*Df))
Age    1.108163  1        1.052693
Gender 1.002032  1        1.001015
Race   1.040913  4        1.005025
BMI    1.078206  3        1.012629
Code
# Influential points
library(broom)
influence_m_imp <- broom::augment(m_imp)

# Plot Cook's distance
ggplot(influence_m_imp, aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 4 / nrow(influence_m_imp), color = "red", linetype = "dashed") +
  labs(x = "Observation", y = "Cook's Distance", title = "Influential Points (Cook's Distance)") +
  theme_minimal()

Code
influence_m_imp <- influence_m_imp %>%
  mutate(outlier = ifelse(abs(.std.resid) > 2, TRUE, FALSE))

# Cook's distance plot
ggplot(influence_m_imp, aes(x = .hat, y = .cooksd)) +
  geom_point() +
  labs(x = "Leverage", y = "Cook's Distance")

Code
###   Transform Response, check for Goodness-of-Fit   ###

# Numeric response
Imputed_data1$Diabetes_num <- ifelse(Imputed_data1$Diabetes == "Yes", 1, 0)

# Hosmer-Lemeshow test

library(ResourceSelection)
hoslem.test(Imputed_data1$Diabetes_num, fitted(m_imp))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  Imputed_data1$Diabetes_num, fitted(m_imp)
X-squared = 12190, df = 8, p-value < 2.2e-16
Code
# ANOVA for model
anova(m_imp)
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova(m_imp, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Adjusted R²
library(glmtoolbox)
adjR2(m_imp)
[1] 0.00335
Code
# Residual vs fitted for imputed data
plot(m_imp$fitted.values, resid(m_imp),
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs Fitted",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

Hosmer-Lemeshow test - was conducted to test for Goodness of fit of multivariate logistic regression model adjR2(m_imp) CHi-square test Visualization of the model (fitted vs residula values)

To overcome the quasi-separation issue in the data, Firth (penalized regression model) was conducted and the summary presented with only 14 complete observations.

Next, to deal with the small sample size, imputation (MICE) was conducted along with the regression predicting Diabetes ~ BMI, Age, Gender, Race.

Summar and visualization of one dataset extracted from the 5 datasets from MICE is presented here with along wiht the predicted values of the imputed model (m_imp), and plots of cross-tabulation between variables and response variable (Diabetes)

Code
# Frequentist logistic regression on raw datd - Firth logistic regression (penalized regression)
library(logistf)

m_firth <- logistf(Diabetes ~ BMI + Age + Gender + Race,
                   data = merged_data)
summary(m_firth)
logistf(formula = Diabetes ~ BMI + Age + Gender + Race, data = merged_data)

Model fitted by Penalized ML
Coefficients:
                                              coef  se(coef)  lower 0.95
(Intercept)                              2.2909182 2.0520292  -1.7059362
BMIOverweight                            2.5151817 1.9838969  -1.2146946
BMIObese                                -0.4519637 1.8778881  -5.4843160
Age                                     -0.2545704 0.1911568  -1.4579663
GenderFemale                            -2.1675543 2.0190594 -17.0316807
RaceOther Hispanic                       0.9518795 1.9225099 -12.0194116
RaceNon-Hispanic White                  -2.5671499 2.2189914 -21.6563170
RaceNon-Hispanic Black                   3.3280688 2.2005227  -0.4678607
RaceOther Race - Including Multi-Racial  1.3072954 2.0609641  -2.6483209
                                        upper 0.95      Chisq          p method
(Intercept)                             16.1697706 1.25690744 0.26223728      2
BMIOverweight                           21.9385716 1.68994830 0.19360778      2
BMIObese                                 3.4687761 0.05538832 0.81393924      2
Age                                      0.1148471 1.81192970 0.17827692      2
GenderFemale                             5.9967059 0.69556457 0.40427809      2
RaceOther Hispanic                      10.8617660 0.20246895 0.65273532      2
RaceNon-Hispanic White                   1.6700624 1.36287409 0.24304000      2
RaceNon-Hispanic Black                  25.3575974 2.86971841 0.09026066      2
RaceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677
Code
# Imputed data plots ## pred_prob_imputed #
 
Imputed_data1 <- Imputed_data1 %>%
  mutate(pred_prob_imputed = predict(m_imp, type = "response")) # predicted probabilities

Imputed_plot <- Imputed_data1 %>% select(BMI, pred_prob_imputed) %>% mutate(Source = "Imputed")


# Rename probability column to common name
Imputed_plot <- Imputed_plot %>% rename(Pred_Prob = pred_prob_imputed)


ggplot(Imputed_data1, aes(x = BMI, y = pred_prob_imputed, fill = BMI)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  labs(x = "BMI Category", y = "Predicted Probability of Diabetes",
       title = "Predicted Diabetes Probability by BMI Category") +
  theme_minimal()

Code
merged_data_clean <- merged_data %>%
  filter(!is.na(BMI), !is.na(Diabetes))

ggplot(merged_data_clean, aes(x = BMI, fill = factor(Diabetes))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "BMI Category", y = "Proportion with Diabetes",
       fill = "Diabetes",
       title = "Observed Diabetes Proportions by BMI Category") +
  theme_minimal()

Code
ggplot(Imputed_data1, aes(x = Race, y = pred_prob_imputed, fill = Race)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  labs(x = "Race ", y = "Predicted Probability of Diabetes",
       title = "Predicted Diabetes Probability by Race ") +
  theme_minimal()

Code
merged_data_clean_Race <- merged_data %>%
  filter(!is.na(Race), !is.na(Diabetes))

ggplot(merged_data_clean, aes(x = Race, fill = factor(Diabetes))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Race ", y = "Proportion with Diabetes",
       fill = "Diabetes",
       title = "Observed Diabetes Proportions by Race ") +
  theme_minimal()

Proportion of Diabetes status and the group category (age <40 and >40) is tabulated below

Code
ggplot(merged_data, aes(x = Age, y = Diabetes)) + 
  geom_jitter(size = 0.2)

Code
ggplot(Imputed_data1, aes(x = Age, y = Diabetes)) + 
  geom_jitter(size = 0.2)

Code
            # Create age groups
# Create contingency table with Diabetes

Imputed_data1$Age_group <- ifelse(Imputed_data1$Age < 40, "<40", ">=40")

tab_age <- table(Imputed_data1$Age_group, Imputed_data1$Diabetes)
prop_age <- prop.table(tab_age, 1) * 100

tab_age
      
        Yes   No
  <40  4414 1691
  >=40 2838  870
Code
prop_age
      
            Yes       No
  <40  72.30139 27.69861
  >=40 76.53722 23.46278
Code
# Convert table to data frame
df_age <- as.data.frame(tab_age)
names(df_age) <- c("Age_group", "Diabetes", "Count")  # rename columns
Code
## Reference: Gelman et al., 2008, “Weakly informative priors: Normal(0, 2.5) for coefficients (b) and Normal(0, 5) for the intercept as default weakly informative priors for logistic regression ##
# bayesian logitic regression ## 
library(brms)

priors <- c(
  set_prior("normal(0, 2.5)", class = "b"),        # for coefficients
  set_prior("normal(0, 5)", class = "Intercept")   # for intercept
)


formula_bayes <- bf(Diabetes ~ Age + BMI + Gender + Race)

Diabetes_prior <- brm(
  formula = formula_bayes,
  data = Imputed_data1,
  family = bernoulli(link = "logit"),   # logistic regression
  prior = priors,
  chains = 4,
  iter = 2000,
  seed = 123,
  control = list(adapt_delta = 0.95)
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000408 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.08 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 14.95 seconds (Warm-up)
Chain 1:                8.295 seconds (Sampling)
Chain 1:                23.245 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000491 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 4.91 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 17.985 seconds (Warm-up)
Chain 2:                6.152 seconds (Sampling)
Chain 2:                24.137 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000512 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 5.12 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 11.074 seconds (Warm-up)
Chain 3:                7.143 seconds (Sampling)
Chain 3:                18.217 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000538 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 5.38 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 16.036 seconds (Warm-up)
Chain 4:                8.546 seconds (Sampling)
Chain 4:                24.582 seconds (Total)
Chain 4: 
Code
## Bayes model summary
summary(Diabetes_prior)
 Family: bernoulli 
  Links: mu = logit 
Formula: Diabetes ~ Age + BMI + Gender + Race 
   Data: Imputed_data1 (Number of observations: 9813) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                                    Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                              -0.77      0.16    -1.09    -0.44 1.00
Age                                    -0.00      0.00    -0.01    -0.00 1.00
BMINormalweight                         0.03      0.16    -0.27     0.34 1.00
BMIOverweight                           0.08      0.16    -0.24     0.40 1.00
BMIObese                                0.03      0.16    -0.29     0.35 1.00
GenderFemale                           -0.09      0.05    -0.18    -0.00 1.00
RaceOtherHispanic                      -0.15      0.09    -0.34     0.03 1.00
RaceNonMHispanicWhite                  -0.21      0.07    -0.34    -0.08 1.00
RaceNonMHispanicBlack                  -0.06      0.07    -0.20     0.09 1.00
RaceOtherRaceMIncludingMultiMRacial    -0.11      0.08    -0.26     0.05 1.00
                                    Bulk_ESS Tail_ESS
Intercept                               1834     1902
Age                                     4495     2737
BMINormalweight                         1820     1960
BMIOverweight                           1899     2263
BMIObese                                1884     1952
GenderFemale                            3564     2756
RaceOtherHispanic                       2685     2528
RaceNonMHispanicWhite                   2156     2917
RaceNonMHispanicBlack                   2444     2590
RaceOtherRaceMIncludingMultiMRacial     2311     2896

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(Diabetes_prior) 

Code
pp_check(Diabetes_prior)

Code
## Draws 

# Generate fitted draws directly with brms
fitted_draws <- fitted(
  Diabetes_prior,
  newdata = Imputed_data1,
  summary = FALSE,   # gives all posterior draws instead of summary
  nsamples = 100     # limit to 100 draws
)

# Convert to long format manually


fitted_long <- data.frame(
  BMI = rep(Imputed_data1$BMI, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

### BMI Plot the fitted lines
ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Age
fitted_long <- data.frame(
  Age = rep(Imputed_data1$Age, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Race

fitted_long <- data.frame(
  Race = rep(Imputed_data1$Race, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Gender
fitted_long <- data.frame(
  Gender = rep(Imputed_data1$Gender, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Age cut by 10 years and Diabetes plot and histogram (imputed data)

 Imputed_data1 %>% 
  mutate(age_bracket = 
           cut(Age, breaks = seq(10, 100, by = 10))) %>% 
  group_by(age_bracket) %>% 
  summarise(Diabetes = mean(Diabetes == "Yes")) %>% 
  ggplot(aes(x = age_bracket, y = Diabetes)) + 
    geom_point() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Code
### predict values of 100 draws, Simulated predictions (binary diabetes outcome)

pred <- posterior_predict(Diabetes_prior, newdata = Imputed_data1, draws = 100)

# data frame for summarizing
pred_df <- as.data.frame(t(pred)) 

# proportion of diabetes = 1 per draw
prop_diabetes <- colMeans(pred_df == 1)


prop_df <- tibble(
  draw = 1:length(prop_diabetes),
  proportion_Diabetes = prop_diabetes    ## proportion of Diabetes with age cut category
)

library(ggplot2)
ggplot(prop_df, aes(x = proportion_Diabetes)) +
  geom_histogram(color = "white")

Bayesian Logistic Regression Model - prior (weakly informative prior used) - complilation, iterations, and posterior draws using NUTS sampling - Fitted draws from the model posterior (n=100) were analyzed - estimates, Rhat were analyzed for convergence. - plots are presented below - Histogram of predicted values (n=100 draws), shows observed proportion of Diabetes ostatus. - - Scatterplot of the proportion of Diabetes grouped by age.

We created posterior model from the posterior draws (100), and analysed our simulated prior model. - plotted posterior predicted values of Diabetes against Age, Race and BMI

Code
library(brms)
library(GGally)

# Simulate the model


Diabetes_model_1 <- update(Diabetes_prior,sample_prior = "yes"   # includes priors + data likelihood
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000406 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.06 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 14.233 seconds (Warm-up)
Chain 1:                7.166 seconds (Sampling)
Chain 1:                21.399 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000368 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 3.68 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 18.457 seconds (Warm-up)
Chain 2:                4.911 seconds (Sampling)
Chain 2:                23.368 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000279 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 2.79 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 15.801 seconds (Warm-up)
Chain 3:                6.328 seconds (Sampling)
Chain 3:                22.129 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000371 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 3.71 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 17.739 seconds (Warm-up)
Chain 4:                6.809 seconds (Sampling)
Chain 4:                24.548 seconds (Total)
Chain 4: 
Code
# BMI
# Posterior fitted values (probabilities of Diabetes)
fitted_draws <- fitted(
  Diabetes_model_1,         # <-- use posterior model here
  newdata = Imputed_data1,
  summary = FALSE,          # full posterior draws
  nsamples = 100            # sample 100 posterior draws
)

# Reshape into long format
fitted_long <- data.frame(
  BMI = rep(Imputed_data1$BMI, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines (BMI)
library(ggplot2)
ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by BMI"
  )

Code
# Age
# Reshape into long format

fitted_long <- data.frame(
  Age = rep(Imputed_data1$Age, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Age"
  )

Code
# Race
# Reshape into long format

fitted_long <- data.frame(
  Race = rep(Imputed_data1$Race, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Race"
  )

Code
fitted_long <- data.frame(
  Gender = rep(Imputed_data1$Gender, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Gender"
  )

Code
# MCMC trace, density, & autocorrelation plots

plot(Diabetes_model_1, combo = c("dens", "trace"))

  • We performed Bayesian logistic regression to model the probability of diabetes as a function of age, BMI, gender, and race. -
  • We used Weakly informative normal priors Gosho et al. (2025) on all regression coefficients. Models were fit using four MCMC chains with 2000 iterations each, including 1000 warm-up iterations. Convergence was assessed using Rhat values, effective sample size, and trace plots. Posterior predictive checks were used to evaluate model fit, and coefficients were exponentiated to report odds ratios with 95% credible intervals
    The cluster of points representing the higher probability of diabetes appears to be denser among individuals in the middle to older Age ranges (e.g., roughly from 40 to 80 years old), compared to the younger Age ranges, although diabetes is present even at younger Ages.

Comparing Models

  • Linear regression model on raw data

  • Multivariate logistic regression on imputed dataset (MI + MLR)

  • Bayesian Logistic Regression on imputed data The spread of these lines provides an indication of the variability or uncertainty in the predicted probabilities within each BMI group (posterior model). the plots visually demonstrate the well-established relationship between BMI and the predicted probability of diabetes, with the risk significantly increasing as BMI moves from normal weight to overweight and obese categories

All Results Summarized

Sample Characteristics (merged data)

  • Number of participants = 9,813 participants with the age range of 0 to 80 years (mean = 31.6 years, median = 27 years, IQR = 10–52 years), with 4,831 males (49.2%) and 4,982 females (50.8%). Largest groups were Non-Hispanic White (n = 3,538, 36.0%) and Non-Hispanic Black (n = 2,198, 22.4%), followed by Mexican American (n = 1,685, 17.2%), Other Hispanic (n = 930, 9.5%), and Other Race/Multiracial (n = 1,462, 14.9%).

  • Body mass index (BMI) categories of 132 participants (1.3%) classified as underweight, 2,167 (22.1%) as normal weight, 595 (6.1%) as overweight, and 629 (6.4%) as obese; however, BMI data were missing for 6,290 participants (64.1%).

  • Total of 553 participants (5.6%) reported having diabetes, while 169 (1.7%) reported no diabetes. A substantial proportion (9,091 participants, 92.6%) had missing diabetes data.

  • Survey design variables included PSU (range = 1–2), strata (range = 104–118), and sampling weights (mean = 31,713; range = 3,748–171,395).

Sample Characteristics (After Multiple Imputation)

  • The imputed sample included 9,813 participants with age range of 0 to 80 years (mean = 31.6 years, median = 27 years, IQR = 10–52 years). Gender distribution remained nearly balanced with 4,831 males (49.2%) and 4,982 females (50.8%).

  • Race/ethnicity composition was Non-Hispanic White (n = 3,538, 36.0%), Non-Hispanic Black (n = 2,198, 22.4%), Mexican American (n = 1,685, 17.2%), Other Hispanic (n = 930, 9.5%), and Other Race/Multiracial (n = 1,462, 14.9%).

  • Body mass index (BMI) distribution of 218 participants (2.2%) classified as underweight, 5,107 (52.0%) as normal weight, 1,831 (18.7%) as overweight, and 2,657 (27.1%) as obese after imputation.

  • For diabetes status, 7,252 participants (73.9%) were classified as having diabetes and 2,561 (26.1%) as not having diabetes. The imputation model included predicted log-odds of diabetes (mean logit = −1.05, range = −1.43 to −0.69) and corresponding predicted probabilities (mean = 0.26, range = 0.19–0.33), which were used to generate imputed diabetes outcomes.

  • Survey design variables were preserved, including PSU (range = 1–2), strata (range = 104–118), and sampling weights (mean = 31,713; range = 3,748–171,395).

Multivariable logistic regression model to examine the association between demographic and anthropometric characteristics and diabetes status in the imputed dataset (N = 9,813).

  • Age was significantly associated with diabetes, with each additional year corresponding to a small reduction in odds (OR = 0.995, 95% CI: 0.993–0.997, p < 0.001).

  • Females had lower odds of diabetes compared with males (OR = 0.91, 95% CI: 0.84–1.00, p = 0.044).

  • Compared with Mexican Americans (reference), Non-Hispanic Whites had significantly lower odds of diabetes (OR = 0.81, 95% CI: 0.71–0.92, p = 0.002). Other Hispanic, Non-Hispanic Black, and Other Race/Multiracial groups were not significantly different from the reference group (all p > 0.09).

  • BMI categories were not significantly associated with diabetes (all p > 0.65).

  • Model diagnostics:

    • ANOVA (Type II sequential tests): Age (χ²(1) = 31.17, p < 0.001), Gender (χ²(1) = 3.93, p = 0.048), and Race (χ²(4) = 12.24, p = 0.016) significantly improved model fit, whereas BMI (χ²(3) = 0.72, p = 0.868) did not.

    • Goodness-of-fit: The Hosmer–Lemeshow test indicated poor model fit (χ²(8) = 12,190, p < 0.001).

    • Predictor independence: Pearson’s chi-squared test across BMI categories was nonsignificant (p = 0.546), indicating no strong association between BMI and diabetes.

    • Multicollinearity: Variance inflation factors (VIFs) were all close to 1, indicating no collinearity issues.

    • Explained variance: The adjusted R² for the model was very low (Adj. R² = 0.003), indicating that the predictors explained <1% of the variance in diabetes status.

Summary: Age, gender, and race/ethnicity (particularly Non-Hispanic White vs. Mexican American) were significant predictors of diabetes, while BMI categories were not.

Regression Analyses

Multiple modeling strategies, including frequentist, penalized likelihood, and Bayesian approaches, were conducted to assess predictors of diabetes.

  1. Firth Penalized Logistic Regression
  • To analyze a sparse dataset (n = 14), Firth’s reduced bias in logistic regression.

  • None of the predictors (BMI, age, gender, race) reached statistical significance (all p > 0.09).

  • Likelihood ratio and Wald tests were nonsignificant (LRT = 6.07, p = 0.64; Wald = 5.26, p = 0.73), suggesting the model provided limited explanatory power in the small sample.

  1. Frequentist Logistic Regression (Imputed Dataset)
  • Analysis of imputed dataset (N = 9,813), ANOVA tests showed, age (OR ≈ 0.995, p < 0.001), gender (female vs. male: OR ≈ 0.91, p = 0.044), and race/ethnicity (Non-Hispanic White vs. Mexican American: OR ≈ 0.81, p = 0.002) were significantly associated with diabetes status.

  • BMI categories were not significant predictors. The adjusted R² was low (Adj. R² = 0.003), indicating limited variance explained.

  1. Bayesian Logistic Regression
  • Bayesian estimation with weakly informative priors yielded results consistent with the frequentist model.

  • Posterior means and 95% credible intervals indicated that increasing age (Estimate = –0.00, 95% CrI: –0.01 to –0.00), female gender (Estimate = –0.09, 95% CrI: –0.18 to 0.00), and Non-Hispanic White ethnicity (Estimate = –0.21, 95% CrI: –0.34 to –0.08) were associated with lower odds of diabetes.

  • BMI categories again showed no significant associations. Convergence diagnostics were satisfactory (Rhat = 1.00 for all parameters; Bulk ESS > 1,800).

  • Bayesian logistic regression model is well-calibrated to reproduce the data distribution, the likelihood assumption (Bernoulli with logit link) is appropriate, no discrepancies (e.g., gray lines shifted away from the observed density), observed, suggesting model fit.

Conclusion

  • Using multiple imputation (MI) allowed inclusion of all 9,813 participants, increasing power and reducing bias compared to a complete-case analysis. In epidemiologic studies, MI is critical when missingness is non-negligible, especially for outcomes like diabetes.

  • Across modeling frameworks, age, gender, and race/ethnicity consistently emerged as predictors of diabetes status, while BMI was not associated. The penalized regression model (n = 14) had insufficient power, whereas both frequentist and Bayesian analyses on the imputed dataset (N ≈ 9,813) showed concordant results, validating the robustness of findings.

  • Posterior distributions and credible intervals from the Bayesian method, provided , offered a probabilistic interpretation of effect sizes that is more intuitive in uncertainty quantification by incorporating prior knowledge.

  • The Firth penalized logistic regression on sparse data, underscores the importance of adequate sample size for reliable inference, reduce small-sample bias, but they cannot compensate for extreme sparsity.

  • Low adjusted R² (~0.003) and Hosmer-Lemeshow test results suggest incorporating additional behavioral, clinical, and biological variables to improve explanatory performance.

  • Low Multicollinearity Assessment emphasize on proper checking of multicollinearity to ensure coefficient estimates are reliable and interpretable.

  • Sequential deviance testing is a useful diagnostic to identify the contribution of individual predictors in logistic regression, complementing coefficient-based inference.

  • Handling missing data (via MI) and comparing frequentist, Bayesian, and penalized approaches strengthens the credibility of results.

Discussions

The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The Firth penalized regression highlighted the limitations of small-sample analyses, showing wide confidence intervals and nonsignificant results when data were sparse.

References

Codes for summaries and tables

Code
summary (merged_data)
       ID                   BMI            Age           Gender    
 Min.   :73557   Underweight  : 132   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:2167   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   : 595   Median :27.00                
 Mean   :78645   Obese        : 629   Mean   :31.63                
 3rd Qu.:81191   NA's         :6290   3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  Race           PSU            Strata     
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
     Weight       Diabetes   
 Min.   :  3748   Yes : 553  
 1st Qu.: 13187   No  : 169  
 Median : 20965   NA's:9091  
 Mean   : 31713              
 3rd Qu.: 37922              
 Max.   :171395              
Code
summary (Imputed_data1)
       ID                   BMI            Age           Gender    
 Min.   :73557   Underweight  : 218   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:5107   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   :1831   Median :27.00                
 Mean   :78645   Obese        :2657   Mean   :31.63                
 3rd Qu.:81191                        3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  Race           PSU            Strata     
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
     Weight       Diabetes       logit              prob         Diabetes_num  
 Min.   :  3748   Yes:7252   Min.   :-1.4253   Min.   :0.1938   Min.   :0.000  
 1st Qu.: 13187   No :2561   1st Qu.:-1.1612   1st Qu.:0.2384   1st Qu.:0.000  
 Median : 20965              Median :-1.0383   Median :0.2615   Median :1.000  
 Mean   : 31713              Mean   :-1.0471   Mean   :0.2610   Mean   :0.739  
 3rd Qu.: 37922              3rd Qu.:-0.9230   3rd Qu.:0.2843   3rd Qu.:1.000  
 Max.   :171395              Max.   :-0.6904   Max.   :0.3339   Max.   :1.000  
 pred_prob_imputed  Age_group        
 Min.   :0.1938    Length:9813       
 1st Qu.:0.2384    Class :character  
 Median :0.2615    Mode  :character  
 Mean   :0.2610                      
 3rd Qu.:0.2843                      
 Max.   :0.3339                      
Code
summary(m_imp)

Call:
glm(formula = Diabetes ~ Age + Gender + Race + BMI, family = binomial, 
    data = Imputed_data1)

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                             -0.761926   0.162425  -4.691 2.72e-06
Age                                     -0.004728   0.001011  -4.678 2.89e-06
GenderFemale                            -0.092730   0.046133  -2.010  0.04442
RaceOther Hispanic                      -0.154250   0.092549  -1.667  0.09558
RaceNon-Hispanic White                  -0.212663   0.067746  -3.139  0.00169
RaceNon-Hispanic Black                  -0.055537   0.072128  -0.770  0.44131
RaceOther Race - Including Multi-Racial -0.109388   0.080345  -1.361  0.17336
BMINormal weight                         0.024339   0.156361   0.156  0.87630
BMIOverweight                            0.071544   0.162658   0.440  0.66005
BMIObese                                 0.020287   0.161137   0.126  0.89981
                                           
(Intercept)                             ***
Age                                     ***
GenderFemale                            *  
RaceOther Hispanic                      .  
RaceNon-Hispanic White                  ** 
RaceNon-Hispanic Black                     
RaceOther Race - Including Multi-Racial    
BMINormal weight                           
BMIOverweight                              
BMIObese                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11267  on 9812  degrees of freedom
Residual deviance: 11219  on 9803  degrees of freedom
AIC: 11239

Number of Fisher Scoring iterations: 4
Code
coef(m_imp)
                            (Intercept)                                     Age 
                           -0.761925862                            -0.004727998 
                           GenderFemale                      RaceOther Hispanic 
                           -0.092730385                            -0.154250348 
                 RaceNon-Hispanic White                  RaceNon-Hispanic Black 
                           -0.212662854                            -0.055536906 
RaceOther Race - Including Multi-Racial                        BMINormal weight 
                           -0.109388375                             0.024339471 
                          BMIOverweight                                BMIObese 
                            0.071543638                             0.020286793 
Code
confint(m_imp)
                                               2.5 %       97.5 %
(Intercept)                             -1.086867141 -0.449109500
Age                                     -0.006713516 -0.002751496
GenderFemale                            -0.183172208 -0.002320639
RaceOther Hispanic                      -0.336602276  0.026303770
RaceNon-Hispanic White                  -0.345137536 -0.079535197
RaceNon-Hispanic Black                  -0.196779898  0.086003952
RaceOther Race - Including Multi-Racial -0.267119094  0.047893182
BMINormal weight                        -0.276189156  0.337871931
BMIOverweight                           -0.241855094  0.396816724
BMIObese                                -0.289985572  0.342735874
Code
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 2.1289, df = 3, p-value = 0.5461
Code
vif(m_imp)
           GVIF Df GVIF^(1/(2*Df))
Age    1.108163  1        1.052693
Gender 1.002032  1        1.001015
Race   1.040913  4        1.005025
BMI    1.078206  3        1.012629
Code
hoslem.test(Imputed_data1$Diabetes_num, fitted(m_imp))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  Imputed_data1$Diabetes_num, fitted(m_imp)
X-squared = 12190, df = 8, p-value < 2.2e-16
Code
influence_m_imp <- broom::augment(m_imp)
marginplot(
  Imputed_data1[, c("BMI", "Diabetes")],
  col = mdc(1:2, trans = FALSE),
  cex = 1.2,
  cex.lab = 1.2,
  cex.numbers = 1.3,
  pch = 19
)

Code
anova(m_imp)
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova(m_imp, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Adjusted R²
library(glmtoolbox)
adjR2(m_imp)
[1] 0.00335
Code
summary(m_firth)
logistf(formula = Diabetes ~ BMI + Age + Gender + Race, data = merged_data)

Model fitted by Penalized ML
Coefficients:
                                              coef  se(coef)  lower 0.95
(Intercept)                              2.2909182 2.0520292  -1.7059362
BMIOverweight                            2.5151817 1.9838969  -1.2146946
BMIObese                                -0.4519637 1.8778881  -5.4843160
Age                                     -0.2545704 0.1911568  -1.4579663
GenderFemale                            -2.1675543 2.0190594 -17.0316807
RaceOther Hispanic                       0.9518795 1.9225099 -12.0194116
RaceNon-Hispanic White                  -2.5671499 2.2189914 -21.6563170
RaceNon-Hispanic Black                   3.3280688 2.2005227  -0.4678607
RaceOther Race - Including Multi-Racial  1.3072954 2.0609641  -2.6483209
                                        upper 0.95      Chisq          p method
(Intercept)                             16.1697706 1.25690744 0.26223728      2
BMIOverweight                           21.9385716 1.68994830 0.19360778      2
BMIObese                                 3.4687761 0.05538832 0.81393924      2
Age                                      0.1148471 1.81192970 0.17827692      2
GenderFemale                             5.9967059 0.69556457 0.40427809      2
RaceOther Hispanic                      10.8617660 0.20246895 0.65273532      2
RaceNon-Hispanic White                   1.6700624 1.36287409 0.24304000      2
RaceNon-Hispanic Black                  25.3575974 2.86971841 0.09026066      2
RaceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677
Code
summary(Diabetes_prior)
 Family: bernoulli 
  Links: mu = logit 
Formula: Diabetes ~ Age + BMI + Gender + Race 
   Data: Imputed_data1 (Number of observations: 9813) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                                    Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                              -0.77      0.16    -1.09    -0.44 1.00
Age                                    -0.00      0.00    -0.01    -0.00 1.00
BMINormalweight                         0.03      0.16    -0.27     0.34 1.00
BMIOverweight                           0.08      0.16    -0.24     0.40 1.00
BMIObese                                0.03      0.16    -0.29     0.35 1.00
GenderFemale                           -0.09      0.05    -0.18    -0.00 1.00
RaceOtherHispanic                      -0.15      0.09    -0.34     0.03 1.00
RaceNonMHispanicWhite                  -0.21      0.07    -0.34    -0.08 1.00
RaceNonMHispanicBlack                  -0.06      0.07    -0.20     0.09 1.00
RaceOtherRaceMIncludingMultiMRacial    -0.11      0.08    -0.26     0.05 1.00
                                    Bulk_ESS Tail_ESS
Intercept                               1834     1902
Age                                     4495     2737
BMINormalweight                         1820     1960
BMIOverweight                           1899     2263
BMIObese                                1884     1952
GenderFemale                            3564     2756
RaceOtherHispanic                       2685     2528
RaceNonMHispanicWhite                   2156     2917
RaceNonMHispanicBlack                   2444     2590
RaceOtherRaceMIncludingMultiMRacial     2311     2896

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
summary(Diabetes_model_1)
 Family: bernoulli 
  Links: mu = logit 
Formula: Diabetes ~ Age + BMI + Gender + Race 
   Data: Imputed_data1 (Number of observations: 9813) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                                    Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                              -0.77      0.16    -1.09    -0.46 1.00
Age                                    -0.00      0.00    -0.01    -0.00 1.00
BMINormalweight                         0.03      0.15    -0.27     0.34 1.00
BMIOverweight                           0.08      0.16    -0.24     0.39 1.00
BMIObese                                0.03      0.16    -0.28     0.34 1.00
GenderFemale                           -0.09      0.05    -0.19    -0.00 1.00
RaceOtherHispanic                      -0.15      0.09    -0.33     0.02 1.00
RaceNonMHispanicWhite                  -0.21      0.07    -0.34    -0.08 1.00
RaceNonMHispanicBlack                  -0.06      0.07    -0.19     0.09 1.00
RaceOtherRaceMIncludingMultiMRacial    -0.11      0.08    -0.27     0.04 1.00
                                    Bulk_ESS Tail_ESS
Intercept                               1641     2148
Age                                     3991     3187
BMINormalweight                         1535     2179
BMIOverweight                           1535     2144
BMIObese                                1519     2160
GenderFemale                            3181     2419
RaceOtherHispanic                       2731     2921
RaceNonMHispanicWhite                   2382     2711
RaceNonMHispanicBlack                   2425     2486
RaceOtherRaceMIncludingMultiMRacial     2659     2598

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
### separate table format 
library(dplyr)
library(knitr)
library(kableExtra)
library(broom)

# Example: three regression models
# m_imputed <- glm(DIABETES ~ AGE + BMI + HTN + HDL, data = df1, family = binomial)
# m_firth <- logistf(Diabetes ~ BMI + Age + Gender + Race, data = merged_data)
# Diabetes_model_1 <- update(Diabetes_prior,sample_prior = "yes"   # includes priors + data likelihood)


# Tidy the imputed logistic regression model
sum_imp <- broom::tidy(m_imp) %>%
  mutate(
    OR = exp(estimate),                # Odds ratio
    `Lower 95% CI` = exp(estimate - 1.96*std.error),  # Lower CI
    `Upper 95% CI` = exp(estimate + 1.96*std.error)   # Upper CI
  ) %>%
  select(term, estimate, std.error, OR, `Lower 95% CI`, `Upper 95% CI`, p.value)

# Display as a table
kbl(sum_imp, digits = 3, booktabs = TRUE, caption = "Imputed Logistic Regression Results") %>%
  kable_classic(full_width = F) 
Imputed Logistic Regression Results
term estimate std.error OR Lower 95% CI Upper 95% CI p.value
(Intercept) -0.762 0.162 0.467 0.340 0.642 0.000
Age -0.005 0.001 0.995 0.993 0.997 0.000
GenderFemale -0.093 0.046 0.911 0.833 0.998 0.044
RaceOther Hispanic -0.154 0.093 0.857 0.715 1.028 0.096
RaceNon-Hispanic White -0.213 0.068 0.808 0.708 0.923 0.002
RaceNon-Hispanic Black -0.056 0.072 0.946 0.821 1.090 0.441
RaceOther Race - Including Multi-Racial -0.109 0.080 0.896 0.766 1.049 0.173
BMINormal weight 0.024 0.156 1.025 0.754 1.392 0.876
BMIOverweight 0.072 0.163 1.074 0.781 1.478 0.660
BMIObese 0.020 0.161 1.020 0.744 1.399 0.900
Code
library(logistf)

# Extract results from Firth regression
sum_firth <- data.frame(
  term = names(m_firth$coef),            # predictor names
  estimate = m_firth$coef,               # regression coefficients
  std.error = sqrt(diag(vcov(m_firth))), # standard errors
  p.value = m_firth$prob                 # p-values
) %>%
  mutate(
    OR = exp(estimate),                             # odds ratio
    `Lower 95% CI` = exp(estimate - 1.96*std.error), 
    `Upper 95% CI` = exp(estimate + 1.96*std.error)
  )

# Display the table
kbl(sum_firth, digits = 3, booktabs = TRUE, caption = "Firth Logistic Regression Results") %>%
  kable_classic(full_width = F)
Firth Logistic Regression Results
term estimate std.error p.value OR Lower 95% CI Upper 95% CI
(Intercept) (Intercept) 2.291 2.052 0.262 9.884 0.177 551.640
BMIOverweight BMIOverweight 2.515 1.984 0.194 12.369 0.253 604.027
BMIObese BMIObese -0.452 1.878 0.814 0.636 0.016 25.247
Age Age -0.255 0.191 0.178 0.775 0.533 1.128
GenderFemale GenderFemale -2.168 2.019 0.404 0.114 0.002 5.988
RaceOther Hispanic RaceOther Hispanic 0.952 1.923 0.653 2.591 0.060 112.168
RaceNon-Hispanic White RaceNon-Hispanic White -2.567 2.219 0.243 0.077 0.001 5.942
RaceNon-Hispanic Black RaceNon-Hispanic Black 3.328 2.201 0.090 27.884 0.373 2082.019
RaceOther Race - Including Multi-Racial RaceOther Race - Including Multi-Racial 1.307 2.061 0.531 3.696 0.065 209.932
Code
library(broom.mixed)   # tidy for brms
library(brms)

# Extract posterior summaries
sum_bayes <- broom.mixed::tidy(Diabetes_model_1, effects = "fixed") %>%
  mutate(
    OR = exp(estimate),                  # odds ratio
    `Lower 95% CI` = exp(estimate - 1.96*std.error),
    `Upper 95% CI` = exp(estimate + 1.96*std.error)
  ) %>%
  select(term, estimate, std.error, OR, `Lower 95% CI`, `Upper 95% CI`)

# Display table
kbl(sum_bayes, digits = 3, booktabs = TRUE, caption = "Bayesian Logistic Regression Results (brms)") %>%
  kable_classic(full_width = F)
Bayesian Logistic Regression Results (brms)
term estimate std.error OR Lower 95% CI Upper 95% CI
(Intercept) -0.769 0.160 0.463 0.339 0.634
Age -0.005 0.001 0.995 0.993 0.997
BMINormalweight 0.029 0.152 1.030 0.765 1.387
BMIOverweight 0.077 0.157 1.080 0.793 1.470
BMIObese 0.027 0.156 1.027 0.756 1.396
GenderFemale -0.092 0.048 0.912 0.831 1.001
RaceOtherHispanic -0.154 0.092 0.857 0.716 1.027
RaceNonMHispanicWhite -0.213 0.067 0.808 0.709 0.922
RaceNonMHispanicBlack -0.056 0.071 0.946 0.823 1.087
RaceOtherRaceMIncludingMultiMRacial -0.111 0.078 0.895 0.768 1.043

References

Ali, Sakhawat, Rizwana Hussain, Rohaib A Malik, Raheema Amin, and Muhammad N Tariq. 2024. Association of Obesity With Type 2 Diabetes Mellitus: A Hospital-Based Unmatched Case-Control Study.” Cureus 16 (1): 1–7. https://doi.org/10.7759/cureus.52728.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Gosho, Masahiko, Ryota Ishii, Kengo Nagashima, Hisashi Noma, and Kazushi Maruo. 2025. Determining the prior mean in Bayesian logistic regression with sparse data: a nonarbitrary approach.” Journal of the Royal Statistical Society. Series C: Applied Statistics 74 (1): 126–41. https://doi.org/10.1093/jrsssc/qlae048.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.